Introduction to Open Data Science - Course Project

About the project

Ruminations about the Introduction to Open Data Science (IODS) 2021

Feelings

The course Introduction to Open Data Science (IODS) feels like a good introduction to the R-Rstudio-git pipeline advocating for reproducible well-documented research.

Challenges may arise due to

  • various backgrounds of the students
  • technological hurdles

Outcomes

I hope the course will generate mindsets favoring open research equipped with technical skills to produce good research output.

I hope to learn not only technical skills but also ways to endorse open data science.

Where to find the course

I found the course in Sisu. Apparently it is also found in the MOOC platform.

Repo

A fork of the course can be found at my repo.


Week 2 - Simple linear regression

date()
## [1] "Sat Dec 11 23:10:36 2021"

Introduction

This week, we are analyzing and modeling a survey data concerning the approaches to learning. Description of the data can be found here and here.

The data has been preprocessed to include variables on gender, age, attitude, exam points, and scores on deep, surface and strategic questions.

Dataset

Begin by loading the analysis data. Let’s use the “official” csv-file; an R script to produce the wrangled data is in the repo.

# read in the data
learning2014 <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", header = TRUE, sep = ",")

dim(learning2014)
## [1] 166   7

There are 166 rows and 7 columns.

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The variables are mostly numeric. Gender is of character value, change it to factor.

# change to factor
learning2014$gender <- as.factor(learning2014$gender)

summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Seems like most of the students are female, age spans 17-55, exam points are between 7-33 and the question variables are capped at 5 (from documentation, on a scale 1-5).

Graphical overview

Using GGlally::ggpairs, we can plot the distribution of the variables, correlation (with significance), boxplots and scatterplots stratified by the gender.

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), 
             lower = list(combo = wrap("facethist", bins = 20)),  
             upper = list(continuous = wrap("cor", size = 3)))

p

We see that

  • Most students are female
  • Age is right skewed (more of smaller age)
  • There is a positive correlation between points and attitude
  • There is, as expected, a negative correlation between surf and deep questions
  • There is a negative correlation between surf and attitude
  • There is a negative correlation between surf and stra questions
  • Distribution of the exam points is not perfectly normal (can be tested with e.g. stats::shapiro.test)

Linear regression

For pedagogical reasons, let’s fit a simple multivariable linear regression first with four (instead of three; choosing three would not change the results) explanatory variables, ie. regress exam points on other variables. Choose variables based on their absolute correlation with the exam points.

sort(abs(cor(learning2014[, -1])[, c("points")]))
##       deep        age       surf       stra   attitude     points 
## 0.01014849 0.09319032 0.14435642 0.14612247 0.43652453 1.00000000
model <- lm(points ~ attitude + stra + surf + age, data = learning2014)

summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf + age, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.8639  -3.2849   0.3086   3.4548  10.7028 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.70644    3.96234   3.459 0.000694 ***
## attitude     3.38964    0.57041   5.942 1.68e-08 ***
## stra         0.93123    0.53986   1.725 0.086459 .  
## surf        -0.76565    0.80258  -0.954 0.341524    
## age         -0.09466    0.05346  -1.771 0.078502 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.262 on 161 degrees of freedom
## Multiple R-squared:  0.2226, Adjusted R-squared:  0.2033 
## F-statistic: 11.52 on 4 and 161 DF,  p-value: 2.987e-08

From the summary, the intercept is very significantly non-zero based on a t-test (H_0: intercept is zero, H_1: intercept is not zero) although the result is not very meaningful as e.g. age cannot be zero in this context.

The attitude variable has a very significant effect on the exam points (p-value about zero for a t-test for H_0: attitude has no effect on the slope when the other variables are kept constant, H_1: attitude has an effect). Variables stra and age are weakly significant or non-significant depending on the nomenclature (same test). Variable surf is not significant, let’s remove it and refit.

model <- lm(points ~ attitude + stra + age, data = learning2014)

summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## age         -0.08822    0.05302  -1.664   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

Now all explanatory variables are significant with a 0.10 significance cutoff.

The F-test is significant, some of the variables are therefore associated with the response variable (have non-zero coefficient).

Interpretation of the linear model

From the summary above, we see that as attitude increases by one, exam points increase by about 3.48 when other variables are kept constant. For stra, the value is about 1.00. Age seems to have a smaller effect, with an additional year decreasing exam points by around 0.09 if other variables do not change.

Multiple R-squared is 0.2182, meaning that the three explanatory variables account for approximately 22% variation in the exam points. This seems quite low and we should be careful when e.g. predicting exam points for new students with this model. The adjusted R-squared considers the number of explanatory variables and has here a similar, a bit smaller, value to the unadjusted R-squared.

Diagnostics

Linear models carry assumptions that need to be verified. We use diagnostic plots for visual assessment.

Residuals versus fitted

# residuals vs fitted values
plot(model, which = 1)

There doesn’t seem to be a clear pattern (maybe visually somewhat of a funnel with decreasing variance but Breusch-Pagan test car::ncvTest(model) doesn’t detect heteroskedasticity) in the residuals versus fitted values in the sense that the variance in residuals is similar across the fitted values. Also, the variation is approximately symmetrical around zero. Constant variance assumption appears to be met. Linear model appears to be OK for the data.

Three outliers (set by id.n in plot.lm) are detected (rows 35, 56 and 145), let’s check them.

print(learning2014[c(35, 56, 145), ])
##     gender age attitude     deep  stra     surf points
## 35       M  20      4.2 4.500000 3.250 1.583333     10
## 56       F  45      3.8 3.000000 3.125 3.250000      9
## 145      F  21      3.5 3.916667 3.875 3.500000      7
annotate <- c(35, 56, 145, 4, 2)  # 4 and 2 for leverage below

# make the plot tighter
op <- par(mfrow=c(3, 1),
          mar = c(4, 4, 1, 1),
          mgp = c(2, 1, 0))

parameters <- c("attitude", "stra", "age")

for (param in parameters) {
  print(param)
  plot(learning2014[, param], learning2014$points, 
       xlab = param, ylab = "points")
  
  text(learning2014[annotate, param], 
       learning2014[annotate, "points"], 
       annotate, pos = 4, col = "red")
}
## [1] "attitude"
## [1] "stra"
## [1] "age"

par(op) # return original par

It appears that here are the students that got low exam points despite having a high attitude value.

Plot boxplot for assessment of symmetry of the residual distribution.

boxplot(resid(model))

Does not seem too bad, but is not perfectly symmetric. QQ-plot will aid in deciding normality.

QQ-plot

# qq-plot
plot(model, which = 2)

Standardized residuals follow the linear pattern quite reasonably with the same outlier exceptions as above. Hence, there is no strong reason to suspect deviation from normality in the distribution of the residuals.

Residuals versus leverage

# residuals vs. leverage
plot(model, which = 5)

Observations 2, 4 and 56 have been marked as outliers, they have relatively high influence of the regression line compared to other observations. We can rerun the model without these to see if the multiple R-squared is increased.

Cook’s distance is also low, supporting the notion that there are no outliers having a great effect on the linear fit.

max(cooks.distance(model))
## [1] 0.1202162
summary(lm(points ~ attitude + stra + age, data = learning2014[-c(2, 4, 56), ]))
## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014[-c(2, 
##     4, 56), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.822  -3.520   0.348   3.617  10.919 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.38472    2.58566   3.243  0.00144 ** 
## attitude     3.64594    0.53695   6.790 2.11e-10 ***
## stra         0.84219    0.51066   1.649  0.10108    
## age          0.01968    0.05677   0.347  0.72930    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.008 on 159 degrees of freedom
## Multiple R-squared:  0.2419, Adjusted R-squared:  0.2276 
## F-statistic: 16.91 on 3 and 159 DF,  p-value: 1.394e-09

Multiple R-square is indeed a bit better. However, removal of the outliers would have to be justified from the data (e.g. are these students somehow different).

VIF

library(car)
## Loading required package: carData
vif(model)
## attitude     stra      age 
## 1.004083 1.014233 1.010865

Variance inflation factors are less than 10 (textbook), so there is no concern for collinearity.

AIC

We can try automatic stepwise model selection by AIC criterion.

library(MASS)
model.full <- lm(points ~ gender + age + attitude + deep + stra + surf, 
                 data = learning2014)
stepAIC(model.full, direction = "both")
## Start:  AIC=558.36
## points ~ gender + age + attitude + deep + stra + surf
## 
##            Df Sum of Sq    RSS    AIC
## - gender    1      0.11 4408.5 556.36
## - surf      1     47.79 4456.1 558.15
## - deep      1     49.00 4457.3 558.19
## <none>                  4408.3 558.36
## - stra      1     83.78 4492.1 559.48
## - age       1     87.88 4496.2 559.63
## - attitude  1    919.82 5328.2 587.82
## 
## Step:  AIC=556.36
## points ~ age + attitude + deep + stra + surf
## 
##            Df Sum of Sq    RSS    AIC
## - surf      1     47.69 4456.1 556.15
## - deep      1     49.11 4457.6 556.20
## <none>                  4408.5 556.36
## - stra      1     88.35 4496.8 557.66
## - age       1     90.16 4498.6 557.72
## + gender    1      0.11 4408.3 558.36
## - attitude  1    999.18 5407.6 588.27
## 
## Step:  AIC=556.15
## points ~ age + attitude + deep + stra
## 
##            Df Sum of Sq    RSS    AIC
## - deep      1     26.61 4482.8 555.14
## <none>                  4456.1 556.15
## + surf      1     47.69 4408.5 556.36
## - age       1     75.36 4531.5 556.93
## - stra      1    106.07 4562.2 558.05
## + gender    1      0.02 4456.1 558.15
## - attitude  1   1084.38 5540.5 590.30
## 
## Step:  AIC=555.14
## points ~ age + attitude + stra
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  4482.8 555.14
## - age       1     76.62 4559.4 555.95
## + deep      1     26.61 4456.1 556.15
## + surf      1     25.20 4457.6 556.20
## - stra      1     97.64 4580.4 556.71
## + gender    1      0.01 4482.8 557.14
## - attitude  1   1060.72 5543.5 588.39
## 
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
## 
## Coefficients:
## (Intercept)          age     attitude         stra  
##    10.89543     -0.08822      3.48077      1.00371

The produced model is the same we derived earlier.

Brute force

Finally, we can try to brute force all linear models of simple combination of explanatory variables.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
## 
##     cement
## The following object is masked from 'package:datasets':
## 
##     rivers
res <- olsrr::ols_step_all_possible(model.full)

res[order(res$adjr, decreasing = TRUE), ]
##    Index N                         Predictors     R-Square Adj. R-Square
## 62    57 5        age attitude deep stra surf 0.2311318189   0.207104688
## 33    22 3                  age attitude stra 0.2181719645   0.203693668
## 52    42 4             age attitude deep stra 0.2228135019   0.203504521
## 54    43 4             age attitude stra surf 0.2225665343   0.203251417
## 63    63 6 gender age attitude deep stra surf 0.2311510113   0.202137842
## 43    44 4           gender age attitude stra 0.2181729919   0.198748718
## 57    58 5      gender age attitude deep stra 0.2228168858   0.198529914
## 59    59 5      gender age attitude stra surf 0.2226046106   0.198311005
## 53    45 4             age attitude deep surf 0.2157221484   0.196236984
## 56    46 4            attitude deep stra surf 0.2154077956   0.195914822
## 17     7 2                      attitude stra 0.2048096617   0.195052725
## 38    23 3                 attitude deep stra 0.2096692889   0.195033535
## 34    24 3                  age attitude surf 0.2081990315   0.193536051
## 40    25 3                 attitude stra surf 0.2074263400   0.192749050
## 58    60 5      gender age attitude deep surf 0.2165385715   0.192055402
## 12     8 2                       age attitude 0.2011434849   0.191341564
## 61    61 5     gender attitude deep stra surf 0.2158241415   0.191318646
## 27    26 3               gender attitude stra 0.2050965812   0.190376147
## 48    47 4          gender attitude deep stra 0.2098633360   0.190232611
## 32    27 3                  age attitude deep 0.2043132366   0.189578297
## 44    48 4           gender age attitude surf 0.2090640320   0.189413449
## 50    49 4          gender attitude stra surf 0.2079025033   0.188223062
## 39    28 3                 attitude deep surf 0.2023788945   0.187608133
## 22    29 3                gender age attitude 0.2017804149   0.186998571
## 3      1 1                           attitude 0.1905536672   0.185618019
## 18     9 2                      attitude surf 0.1952865878   0.185412804
## 42    50 4           gender age attitude deep 0.2048827672   0.185128302
## 49    51 4          gender attitude deep surf 0.2040700385   0.184295381
## 16    10 2                      attitude deep 0.1939911024   0.184101423
## 28    30 3               gender attitude surf 0.1970278452   0.182157990
## 8     11 2                    gender attitude 0.1919357660   0.182020867
## 26    31 3               gender attitude deep 0.1952588802   0.180356267
## 47    52 4               gender age stra surf 0.0653293394   0.042107708
## 60    62 5          gender age deep stra surf 0.0707276010   0.041687839
## 37    32 3                      age stra surf 0.0520482669   0.034493605
## 55    53 4                 age deep stra surf 0.0568671481   0.033435276
## 24    33 3                    gender age stra 0.0504456777   0.032861338
## 31    34 3                   gender stra surf 0.0462022812   0.028539360
## 45    54 4               gender age deep stra 0.0514840047   0.027918390
## 51    55 4              gender deep stra surf 0.0510011757   0.027423565
## 21    12 2                          stra surf 0.0363412029   0.024517169
## 25    35 3                    gender age surf 0.0420448279   0.024304917
## 41    36 3                     deep stra surf 0.0407134651   0.022948900
## 10    13 2                        gender stra 0.0346692266   0.022824677
## 46    56 4               gender age deep surf 0.0462679619   0.022572756
## 15    14 2                           age surf 0.0340077514   0.022155086
## 14    15 2                           age stra 0.0331743748   0.021311484
## 36    37 3                      age deep surf 0.0379369858   0.020121004
## 29    38 3                   gender deep stra 0.0357520229   0.017895579
## 35    39 3                      age deep stra 0.0336895603   0.015794923
## 5      2 1                               stra 0.0213517768   0.015384410
## 6      3 1                               surf 0.0208387755   0.014868280
## 11    16 2                        gender surf 0.0267878521   0.014846599
## 30    40 3                   gender deep surf 0.0306219089   0.012670463
## 20    17 2                          deep surf 0.0244545065   0.012484623
## 19    18 2                          deep stra 0.0219453518   0.009944681
## 7     19 2                         gender age 0.0196556536   0.007626889
## 2      4 1                                age 0.0086844365   0.002639829
## 1      5 1                             gender 0.0086318623   0.002586935
## 23    41 3                    gender age deep 0.0198419947   0.001690921
## 9     20 2                        gender deep 0.0088743608  -0.003286690
## 13    21 2                           age deep 0.0087454938  -0.003417138
## 4      6 1                               deep 0.0001029919  -0.005993941
##    Mallow's Cp
## 62    5.003969
## 33    3.684101
## 52    4.724219
## 54    4.775293
## 63    7.000000
## 43    5.683889
## 57    6.723519
## 59    6.767418
## 53    6.190730
## 56    6.255739
## 17    4.447461
## 38    5.442477
## 34    5.746530
## 40    5.906325
## 58    8.021891
## 12    5.205636
## 61    8.169637
## 27    6.388125
## 48    7.402347
## 32    6.550123
## 44    7.567646
## 50    7.807853
## 39    6.950150
## 22    7.073917
## 3     5.395638
## 18    6.416857
## 42    8.432342
## 49    8.600417
## 16    6.684767
## 28    8.056761
## 8     7.109816
## 26    8.422587
## 47   37.292359
## 60   38.175985
## 37   38.038920
## 55   39.042363
## 24   38.370340
## 31   39.247886
## 45   40.155611
## 51   40.255461
## 21   39.287183
## 25   40.107658
## 41   40.382987
## 10   39.632952
## 46   41.234303
## 15   39.769746
## 14   39.942091
## 36   40.957170
## 29   41.409026
## 35   41.835549
## 5    40.387035
## 6    40.493125
## 11   41.262841
## 30   42.469948
## 20   41.745383
## 19   42.264283
## 7    42.737798
## 2    43.006675
## 1    43.017547
## 23   44.699262
## 9    44.967398
## 13   44.994048
## 4    44.781340

By adjusted R-Square, it appears that the three variable model we used before fares very well compared to the full model. Attitude alone is also quite good in comparison and could be chosen by parsimony.


Week 3 - Logistic regression

date()
## [1] "Sat Dec 11 23:10:47 2021"

Introduction

This week, we are analyzing and modeling data on two questionnaires related to student performance and alcohol consumption. Further description is available here. Since the description is relevant for analysis we reprint it verbatim:

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

The dataset has been preprocessed by joining data from two courses, Math and Portugese language. Variables alc_use and high_use have been added in the wrangling phase.

Dataset

Begin by loading the analysis data. Let’s use the “official” csv-file; an R script to produce the wrangled data is in the repo.

rm(list=ls())

# read in the data, convert strings to factors
alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv",
                header = TRUE, sep = ",", stringsAsFactors = TRUE)

dim(alc)
## [1] 370  51

There are 370 rows and 51 columns.

str(alc)
## 'data.frame':    370 obs. of  51 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ n         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ id.p      : int  1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
##  $ id.m      : int  2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ failures.p: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ paid.p    : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ absences.p: int  4 2 8 2 2 2 0 0 6 10 ...
##  $ G1.p      : int  13 13 14 10 13 11 10 11 15 10 ...
##  $ G2.p      : int  13 11 13 11 13 12 11 10 15 10 ...
##  $ G3.p      : int  13 11 12 10 13 12 12 11 15 10 ...
##  $ failures.m: int  1 2 0 0 2 0 2 0 0 0 ...
##  $ paid.m    : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
##  $ absences.m: int  2 2 8 2 8 2 0 2 12 10 ...
##  $ G1.m      : int  7 8 14 10 10 12 12 8 16 10 ...
##  $ G2.m      : int  10 6 13 9 10 12 0 9 16 11 ...
##  $ G3.m      : int  10 5 13 8 10 11 0 8 16 11 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ cid       : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...

Seems like there are mostly integer and character (factorized) valued variables, as well as at least one logical (high_use) and float (alc_use).

# (Type for factor is int.)
table(sapply(alc, typeof))
## 
##  double integer logical 
##       1      49       1
# Check for zero variance (negate characters to get "not all duplicated")
sort(
  sapply(alc, function(x) { 
    ifelse(is.numeric(x), var(x), !all(duplicated(x)[-1L])) 
    }))
##            n   failures.p     failures   traveltime   failures.m    studytime 
## 0.000000e+00 2.398667e-01 3.109939e-01 4.916502e-01 5.049513e-01 7.189922e-01 
##         Dalc       famrel     freetime      alc_use       school          sex 
## 8.032594e-01 8.304695e-01 9.712224e-01 9.958178e-01 1.000000e+00 1.000000e+00 
##      address      famsize      Pstatus         Mjob         Fjob       reason 
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
##     guardian    schoolsup       famsup   activities      nursery       higher 
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
##     internet     romantic         paid       paid.p       paid.m     high_use 
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 
##         Medu         Fedu        goout          age         Walc       health 
## 1.173984e+00 1.179697e+00 1.273720e+00 1.393987e+00 1.666366e+00 1.981220e+00 
##         G2.p         G1.p           G1           G2         G3.p           G3 
## 6.078518e+00 6.512854e+00 7.301699e+00 7.914627e+00 8.665092e+00 1.080868e+01 
##         G1.m         G2.m         G3.m   absences.p     absences   absences.m 
## 1.119153e+01 1.444749e+01 2.124131e+01 2.330626e+01 3.029392e+01 5.876224e+01 
##          cid         id.m         id.p 
## 1.143917e+04 1.274368e+04 3.041907e+04
# Seems OK, only `n` has zero variance.

Variables correlated to alcohol consumption

Let’s apply a data-driven approach to discover the top 4 variables linked to the alcohol consumption. Intuitively, family relationships and characteristics, study time, grades, and peer behavior should have an effect.

We want to compute association between all the variables. To allow for factors, we employ a stackoverflow solution

# From 
# https://stackoverflow.com/questions/52554336/plot-the-equivalent-of-correlation-matrix-for-factors-categorical-data-and-mi/56485520#56485520

library(tidyverse)
library(rcompanion)


# Calculate a pairwise association between all variables in a data-frame. In particular nominal vs nominal with Chi-square, numeric vs numeric with Pearson correlation, and nominal vs numeric with ANOVA.
# Adopted from https://stackoverflow.com/a/52557631/590437
mixed_assoc = function(df, cor_method="spearman", adjust_cramersv_bias=TRUE){
    df_comb = expand.grid(names(df), names(df),  stringsAsFactors = F) %>% set_names("X1", "X2")

    is_nominal = function(x) class(x) %in% c("factor", "character")
    # https://community.rstudio.com/t/why-is-purr-is-numeric-deprecated/3559
    # https://github.com/r-lib/rlang/issues/781
    is_numeric <- function(x) { is.integer(x) || is_double(x)}

    f = function(xName,yName) {
        x =  pull(df, xName)
        y =  pull(df, yName)

        result = if(is_nominal(x) && is_nominal(y)){
            # use bias corrected cramersV as described in https://rdrr.io/cran/rcompanion/man/cramerV.html
            cv = cramerV(as.character(x), as.character(y), bias.correct = adjust_cramersv_bias)
            data.frame(xName, yName, assoc=cv, type="cramersV")

        }else if(is_numeric(x) && is_numeric(y)){
            correlation = cor(x, y, method=cor_method, use="complete.obs")
            data.frame(xName, yName, assoc=correlation, type="correlation")

        }else if(is_numeric(x) && is_nominal(y)){
            # from https://stats.stackexchange.com/questions/119835/correlation-between-a-nominal-iv-and-a-continuous-dv-variable/124618#124618
            r_squared = summary(lm(x ~ y))$r.squared
            data.frame(xName, yName, assoc=sqrt(r_squared), type="anova")

        }else if(is_nominal(x) && is_numeric(y)){
            r_squared = summary(lm(y ~x))$r.squared
            data.frame(xName, yName, assoc=sqrt(r_squared), type="anova")

        }else {
            warning(paste("unmatched column type combination: ", class(x), class(y)))
        }

        # finally add complete obs number and ratio to table
        result %>% mutate(complete_obs_pairs=sum(!is.na(x) & !is.na(y)), complete_obs_ratio=complete_obs_pairs/length(x)) %>% rename(x=xName, y=yName)
    }

    # apply function to each variable combination
    map2_df(df_comb$X1, df_comb$X2, f)
}

Then we can use the function for correlations

# remove boolean high_use (or recode it to integer/factor)
mixed.cors <- mixed_assoc(alc[, -which(names(alc) %in% c("high_use"))])
library(corrplot)

cor.matrix <- 
  mixed.cors %>% 
  select(x, y, assoc) %>% 
  spread(y, assoc) %>% 
  select(-x) %>%
  replace(is.na(.), 0)

rownames(cor.matrix) <- colnames(cor.matrix)

corrplot(as.matrix(cor.matrix), method = "color",
         tl.cex = 0.5, tl.col = "black")

From the correlation plot, we see that, as expected by definition, Dalc and Walc correlate positively with alc_use. It would be interesting to include these for further analysis to explain the drinking behavior. Since we’re limited to four variables, we choose a coarse approach and only take alc_use.

The grade variables correlate negatively with alc_use; based on the data annotation in the beginning, it is easiest to choose G3 for further analysis. The famrel correlates negatively with alc_use, and seems quite interesting. goout, sex and studytime are interesting as well. Let’s make sure these variables don’t correlate too much.

interesting_vars <- c("alc_use", "G3", "famrel", "goout", "sex", "studytime")

cor.matrix <- 
  alc %>%
    select(interesting_vars) %>%
    mixed_assoc(cor_method = "pearson") %>%
    select(x, y, assoc) %>% 
    spread(y, assoc) %>%
    select(-x)

rownames(cor.matrix) <- colnames(cor.matrix)

corrplot(as.matrix(cor.matrix), method = "color",
         tl.cex = 1, tl.col = "black", addCoef.col = 'black')

Seems good!

# Select the subset
alc.sub <- alc[, interesting_vars]

str(alc.sub)
## 'data.frame':    370 obs. of  6 variables:
##  $ alc_use  : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ G3       : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ famrel   : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ goout    : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ sex      : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ studytime: int  4 2 1 3 3 3 3 2 4 4 ...

Further exploration

Begin with the GGally::ggpairs

library(GGally)
library(ggplot2)

p <- ggpairs(alc.sub, mapping = aes(col = sex, alpha = 0.3), 
             lower = list(combo = wrap("facethist", bins = 20)),  
             upper = list(continuous = wrap("cor", size = 3)))

p

We see that

  • There is approximately equal number of males (n=175) and females (n=195)
  • Males tend to use more alcohol
  • Distribution of grade G3 is similar for both sexes, and is not normal (by shapiro.test()), has a peak and is left skewed
  • famrel correlates with alc_use, but significantly only for males
  • Males report better quality family relationships
  • Males use less time for studying
  • goout correlates positively with alcohol use with a high significance
  • studytime correlates negatively with alcohol use with a high significance

All in all, there are several hypotheses

  • Getting a better grade while being male is associated with lower alcohol use
  • Family relationships affect alcohol use
  • Alcohol is used when going out
  • Alcohol is not used when studying

Let’s take a closer look on the quality of family relationship versus alcohol use

library(tidyverse)

alc_var <- function(alc, expl.variable) {
  expl.vars <- c(expl.variable, "alc_use", "sex")
  
  # Tidyverse black magic
  res <- 
    alc %>%
      select(one_of(expl.vars)) %>%
      count(!!!sapply(expl.vars, as.symbol))
  
  g <- ggplot(res, aes(x = .data[[expl.variable]], y = alc_use, col = sex)) + 
    geom_point(aes(size = n), alpha = 0.5) + geom_smooth(method = "loess")
  
  return(g)
}

alc_var(alc, "famrel")

With a bit of caution, it appears than men with low quality famrel use more alcohol, as there is an increase in alc_use when famrel goes from medium to bad. However, the relationship is roughly linear (without deep analysis):

summary(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))
## 
## Call:
## lm(formula = alc_use ~ famrel, data = alc[alc$sex == "M", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7458 -0.9118 -0.1898  0.8102  3.0882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.30184    0.38377   8.604 4.55e-15 ***
## famrel      -0.27800    0.09368  -2.968  0.00343 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.128 on 173 degrees of freedom
## Multiple R-squared:  0.04844,    Adjusted R-squared:  0.04294 
## F-statistic: 8.807 on 1 and 173 DF,  p-value: 0.003427
# make the plot tighter
op <- par(mfrow=c(2, 2),
          mar = c(4, 4, 2, 1),
          mgp = c(2, 1, 0))

plot(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))

par(op) # return original par

library(car)
# There is no heteroscedasticity by Breusch-Pagan test
car::ncvTest(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.398195, Df = 1, p = 0.52802

Next, plot the other variables as well

library(ggpubr)

g1 <- alc_var(alc, "goout")
g2 <- alc_var(alc, "sex")
g3 <- alc_var(alc, "studytime")
g4 <- alc_var(alc, "G3")

ggarrange(g1, g2, g3, g4,
          ncol = 4, nrow = 1, common.legend = TRUE, 
          legend = "bottom")

When goout increases, so does the alcohol use. Males appear to use more alcohol. As studytime increases, alcohol usage drops. G3 does not have a clear linear pattern with alcohol use. Otherwise, the above hypotheses are OK.

Logistic regression

Now we are ready to fit the logistic regression model with G3, famrel, goout, sex and studytime.

model.1 <- glm(paste0("high_use", " ~ ", paste(interesting_vars[-1], collapse=" + ")),
               family = "binomial", data = alc)

summary(model.1)
## 
## Call:
## glm(formula = paste0("high_use", " ~ ", paste(interesting_vars[-1], 
##     collapse = " + ")), family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6895  -0.7784  -0.4891   0.8155   2.6807  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.85026    0.85893  -0.990  0.32222    
## G3          -0.03667    0.04031  -0.910  0.36294    
## famrel      -0.41604    0.13989  -2.974  0.00294 ** 
## goout        0.77038    0.12402   6.212 5.25e-10 ***
## sexM         0.80171    0.26695   3.003  0.00267 ** 
## studytime   -0.46034    0.17229  -2.672  0.00754 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 369.86  on 364  degrees of freedom
## AIC: 381.86
## 
## Number of Fisher Scoring iterations: 4

We got a model with four significant variables. G3 is not significant, remove it and refit.

model.2 <- glm(high_use ~ famrel + goout + sex + studytime,
               family = "binomial", data = alc)

summary(model.2)
## 
## Call:
## glm(formula = high_use ~ famrel + goout + sex + studytime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5891  -0.7629  -0.4976   0.8304   2.6937  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2672     0.7312  -1.733  0.08309 .  
## famrel       -0.4193     0.1399  -2.996  0.00273 ** 
## goout         0.7873     0.1230   6.399 1.57e-10 ***
## sexM          0.7959     0.2669   2.982  0.00286 ** 
## studytime    -0.4814     0.1711  -2.814  0.00490 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 370.69  on 365  degrees of freedom
## AIC: 380.69
## 
## Number of Fisher Scoring iterations: 4

There is a lot of output

  • Fitted model is \(log \frac{Pr(high\_use =1)}{ 1 - Pr(high\_use = 1)} = log \frac{Pr(high\_use =1)}{Pr(high\_use = 0)} = -1.2672 - 0.4193*famrel + 0.7873 * goout + 0.7959 * sexM - 0.4814 * studytime\)
  • The coefficients represent a change in the log odds of the response when the explanatory variable increases by one, conditional on the other explanatory variables remaining constant
  • All of the variables (and the non-zero intercept with 0.10 p-value cutoff) are significant by Wald test
  • Deviance is similar in concept to the squared error in ordinary regression and quantifies difference between the observed and predicted proportions/probabilities
  • Residual deviance is the sum of squares for residuals
  • Null deviance is the deviance for null model (high_use ~ 1)
  • AIC is the Akaike information criterion and can be used for model selection
  • Number of Fisher Scoring iterations is related to how many iterations are need to fit the model
# Residual deviance
sum(residuals(model.2, type = "deviance")^2)
## [1] 370.6854

See if residuals are roughly linear, otherwise we can’t use GLM.

# Plot residuals and check that linearity is OK
plot(model.2, which = 1)  # -> Looks OK.

We can also test with ANOVA if null model is significantly different from our model

# see http://homepage.stat.uiowa.edu/~rdecook/stat3200/notes/LogReg_part2_4pp.pdf
# and https://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_logistic_regression_glm.pdf
# and https://stats.stackexchange.com/questions/59879/

# 1-pchisq(model.2$null.deviance-model.2$deviance, model.2$df.null-model.2$df.residual)
null <- glm(high_use ~ 1, family = "binomial", data = alc)
anova(null, model.2, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ 1
## Model 2: high_use ~ famrel + goout + sex + studytime
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       369     452.04                          
## 2       365     370.69  4   81.354 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We conclude that at least one of the explanatory variables has non-zero coefficient by this test.

Compute odds ratios and confidence intervals

OR <- coef(model.2) %>% exp
CI <- confint(model.2) %>% exp

list(OR = OR, CI = CI)
## $OR
## (Intercept)      famrel       goout        sexM   studytime 
##   0.2816141   0.6575181   2.1974324   2.2165443   0.6179355 
## 
## $CI
##                  2.5 %    97.5 %
## (Intercept) 0.06545486 1.1622100
## famrel      0.49791884 0.8636137
## goout       1.73873662 2.8198312
## sexM        1.31841735 3.7630180
## studytime   0.43751933 0.8576353

The odds ratio for famrel is 0.66 and 95% CI is [0.50, 0.86]. Thus, increase by 1 unit in famrel is associated with a decrease in the odds of high_use by 14-50%.

Variables goout (increases high_use odds) and studytime (decreases high_use odds) are interpreted analogously and neither containts 1 in the 95% CI.

The sexM is interpreted in relation to implicit sexF: being male increases the high_use odds by around 122% with 95% CI of [1.32, 3.76].

It seems that the results correspond to the hypotheses. It feels like famrel would require more work although the interaction term of sex*famrel is not significant when added to model.2 (data not shown).

Predictive power

observed <- alc$high_use
predicted <- predict(model.2, type = "response") > 0.5

# 2x2 tabulation, confusion matrix
table(observed, predicted)
##         predicted
## observed FALSE TRUE
##    FALSE   230   29
##    TRUE     59   52

Most cases are classified correctly.

Below are the predictions versus actual high_use values, plotted separately for each explanatory variable.

# Get predictions, compare against true values
predicted_abs <- predict(model.2, type = "response")

df <- data.frame(cbind(observed, predicted, predicted_abs, 
                       alc[, interesting_vars[3:6]]))

df$matches <- df$observed == df$predicted

# Plot by explanatory variables
prob.expl <- function(df, ind.var, expl.var, col.var) {

  g <- ggplot(df, aes(x = .data[[ind.var]], 
                      y = .data[[expl.var]])) + 
    geom_point(alpha = 0.5, aes(col = .data[[col.var]])) + 
    geom_smooth(method = "loess", col = "black", size = 0.5) +
    theme_bw()
  g
}
  
g1 <- prob.expl(df, "predicted_abs", "famrel", "matches")
g2 <- prob.expl(df, "predicted_abs", "goout", "matches")
g3 <- prob.expl(df, "predicted_abs", "sex", "matches")
g4 <- prob.expl(df, "predicted_abs", "studytime", "matches")

ggarrange(g1, g2, g3, g4, ncol = 2, nrow = 2, labels = c("A", "B", "C", "D"),
          common.legend = TRUE, legend = "bottom")

From the plot above we see where the predictions of high_useare right (greenish points) and wrong (red points). For famrel (A), sex (C) and studytime (D), the predictions seem to be false mostly around probability 0.5. With goout, high_use appears to be poorly predicted when goout is low and predicted value is in the range of [0.5, 0.7].

Quantify at which probability values mismatches tend to occur:

# Predictions equal observation
round(table(cut_interval(df[df$matches == TRUE, c("predicted_abs")], length=0.20)) /
  length(df$matches), 2)
## 
##   [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]   (0.8,1] 
##      0.36      0.22      0.08      0.08      0.02
# False predictions
round(table(cut_interval(df[df$matches == FALSE, c("predicted_abs")], length=0.20)) /
  length(df$matches), 2)
## 
##   [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] 
##      0.06      0.05      0.10      0.03

This confirms that most false predictions are made around the probability 0.5.

The training error of the model is 0.24, i.e. lowish. If we just guessed using, say, goout, the error would be around 0.28, which is not very bad. Of course the guessing is not very random, since we already know that goout is an important predictor.

# Training error, model.2
sum(df$matches == FALSE) / length(df$matches)
## [1] 0.2378378
# Guess high_use by goout
guesses <- rep(FALSE, times = nrow(df))
guesses[df$goout > mean(df$goout)] <- TRUE

table(df$observed, guesses)
##        guesses
##         FALSE TRUE
##   FALSE   198   61
##   TRUE     41   70
sum(guesses != df$observed) / nrow(df)
## [1] 0.2756757

Cross-validation

Follow the code in DataCamp for this one

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, df$predicted_abs)
## [1] 0.2378378
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model.2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2540541

The test set performance is a bit better than the DataCamp baseline, 0.25 versus 0.26.

Simple model selection

# Select all explanatory variables available
training.vars <- colnames(alc)[1:36] # Discard non-joined .m and .p variables

# Discard alcohol consumption variables and technical variables
training.vars <- training.vars[!(
  training.vars %in% c("Dalc", "Walc", "n", "id.p", "id.m"))]

Now, with enough compute we could just try all the variables, but that would amount to 2^32 = 4294967296 models. We instead use stepwise selection by AIC starting with the full model and progressing backwards. We also assume R handles the factor variable to dummy (treatment) encoding. (Note the problems of stepwise regression. It would be advisable to use other methods e.g. through Caret.)

library(MASS)

# Run stepwise AIC selection on full model, keep the model and its AIC
final.m <- stepAIC(
  # Full model
  glm(paste0("high_use", " ~ ", paste(training.vars[-1], collapse=" + ")),
             family = binomial, data = alc),
  direction = "backward", trace = FALSE, 
  # What to keep from models
  keep = function(model, aic) { list(model = model, aic = aic) } )

Then perform 10-fold CV for the models

# Init
CVs  <- rep(NULL, ncol(final.m$keep))  # Test error
ERRs <- rep(NULL, ncol(final.m$keep))  # Train error
AICs <- rep(NULL, ncol(final.m$keep))
Nvar <- rep(NULL, ncol(final.m$keep))

for (i in 1:ncol(final.m$keep)) {  # Each column is a model
  
  interim.m <- final.m$keep[, i][1]$model
  CVs[i] <- cv.glm(data = alc, cost = loss_func, glmfit = interim.m, K = 10)$delta[1]
  
  ERRs[i] <- loss_func(alc$high_use, predict(interim.m, type = "response"))
  
  AICs[i] <- final.m$keep[, i][2]$aic
  
  Nvar[i] <- length(final.m$keep[, i][1]$model$coefficients) - 1
}

CV.res <- data.frame(test = CVs, train = ERRs, AIC = AICs, Nvar = Nvar)
# Somewhat unsatisfactory graph with points

g1 <- ggplot(CV.res, aes(x = Nvar, y = test)) + geom_line() + 
  geom_point(aes(x = Nvar, y = test, color = AIC), size = 5, alpha = 0.8) +
  theme_bw() +  xlab("Number of variables") + ylab("Test set error") +
  ggtitle("10-fold CV on backward stepwise variable selection with AIC") +
  scale_color_gradient(low="blue", high="red")

g1

# Better graph with both test and training error, and AIC separately

errors <- gather(CV.res, type, error, c("test", "train"))

g1 <- ggplot(errors, aes(x = Nvar, color = type)) + geom_line(aes(y = error)) + 
  theme_bw() +  xlab("Number of variables") + ylab("Test set error")

g2 <- ggplot(errors, aes(x = Nvar)) + geom_line(aes(y = AIC)) + 
  theme_bw() +  xlab("Number of variables") + ylab("AIC")
  
g <- ggarrange(g1, g2, ncol = 2, nrow = 1, widths = c(2, 1.5),
          common.legend = FALSE, legend = "left")

annotate_figure(g, top = text_grob("10-fold CV on backward stepwise variable selection with AIC"))

The final model was

summary(final.m)
## 
## Call:
## glm(formula = high_use ~ sex + address + famsize + reason + guardian + 
##     activities + romantic + famrel + goout + health + absences, 
##     family = binomial, data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8167  -0.7154  -0.4262   0.5635   2.7612  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.14937    0.86318  -2.490 0.012772 *  
## sexM              0.98275    0.28722   3.422 0.000623 ***
## addressU         -0.89491    0.33661  -2.659 0.007847 ** 
## famsizeLE3        0.47009    0.29971   1.569 0.116759    
## reasonhome        0.31183    0.35612   0.876 0.381229    
## reasonother       1.24517    0.47895   2.600 0.009328 ** 
## reasonreputation -0.28654    0.36969  -0.775 0.438278    
## guardianmother   -0.68303    0.32428  -2.106 0.035179 *  
## guardianother     0.41726    0.71637   0.582 0.560256    
## activitiesyes    -0.51481    0.28383  -1.814 0.069710 .  
## romanticyes      -0.50684    0.30464  -1.664 0.096171 .  
## famrel           -0.50627    0.15580  -3.249 0.001156 ** 
## goout             0.91148    0.13829   6.591 4.36e-11 ***
## health            0.16738    0.10359   1.616 0.106138    
## absences          0.09152    0.02373   3.856 0.000115 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 335.02  on 355  degrees of freedom
## AIC: 365.02
## 
## Number of Fisher Scoring iterations: 5

Removing all non-significant explanatory variables and iterating yields a model with 0.20 prediction error and 0.22 test set error in 10-fold CV:

parsimonious.m <- glm(formula = high_use ~ sex + address + reason + guardian + 
    activities + famrel + goout + absences, family = binomial, data = alc)

summary(parsimonious.m)
## 
## Call:
## glm(formula = high_use ~ sex + address + reason + guardian + 
##     activities + famrel + goout + absences, family = binomial, 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8386  -0.7558  -0.4676   0.6257   2.8133  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.65723    0.77828  -2.129 0.033225 *  
## sexM              1.09542    0.27999   3.912 9.14e-05 ***
## addressU         -0.84765    0.33097  -2.561 0.010435 *  
## reasonhome        0.24841    0.34895   0.712 0.476533    
## reasonother       1.06110    0.46814   2.267 0.023412 *  
## reasonreputation -0.35044    0.36541  -0.959 0.337531    
## guardianmother   -0.65433    0.31871  -2.053 0.040071 *  
## guardianother     0.27975    0.69303   0.404 0.686459    
## activitiesyes    -0.55234    0.28067  -1.968 0.049075 *  
## famrel           -0.45739    0.15011  -3.047 0.002311 ** 
## goout             0.87831    0.13550   6.482 9.06e-11 ***
## absences          0.08739    0.02335   3.743 0.000182 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 341.90  on 358  degrees of freedom
## AIC: 365.9
## 
## Number of Fisher Scoring iterations: 5
loss_func(alc$high_use, predict(parsimonious.m, type = "response"))
## [1] 0.1972973
cv.glm(data = alc, cost = loss_func, glmfit = parsimonious.m, K = 10)$delta[1]
## [1] 0.2297297

Week 4 - Clustering and classification

date()
## [1] "Sat Dec 11 23:11:22 2021"

Introduction

This week, we are analyzing and modeling a dataset from MASS package, “Housing Values in Suburbs of Boston”. References and further description can be found here.

Dataset

rm(list=ls())

library(MASS)
boston <- MASS::Boston

dim(boston)
## [1] 506  14

The data hase 506 rows and 14 columns.

str(boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Most of the columns are double valued, chas and rad are integer typed.

table(sapply(boston, typeof))
## 
##  double integer 
##      12       2

Summary on the variables is shown below. crim is per capita crime rate by town, tax is a tax rate. chas is a dummy variable describing relative location to Charles River (1 if tract bounds river). Full descriptions are here.

summary(boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Correlation plot shows many strong correlations (without going into p-values):

library(corrplot)
library(tidyverse)

cor.matrix <- cor(boston)

rownames(cor.matrix) <- colnames(cor.matrix)

corrplot(as.matrix(cor.matrix), method = "ellipse", 
         order = "AOE", number.cex = 0.5,  # Use some nice ordering
         tl.cex = 1, tl.col = "black", addCoef.col = 'black')

We can filter and show correlations whose absolute value is above 0.7. Only show significant correlation coefficients. We see that e.g. dis and age correlate negatively, while nox and age correlate positively.

cor.matrix.filtered <- cor(boston)

cor.matrix.filtered <- apply(cor.matrix.filtered, 2, function(x) {
  ifelse(abs(x) > 0.7, x, 0)
})

rownames(cor.matrix.filtered) <- colnames(cor.matrix.filtered)

# Don't show zeroes. Get the order manually to match that above.
ord <- corrMatOrder(cor.matrix, order="AOE")

cor.matrix.filtered <- cor.matrix.filtered[ord, ord]
col <- ifelse(abs(cor.matrix.filtered - 0) < .Machine$double.eps, "white", "black") #[ord, ord]

# Test for correlation significance.
testRes = cor.mtest(boston, conf.level = 0.95)

corrplot(as.matrix(cor.matrix.filtered),
         method = "ellipse", order = "original", number.cex = 0.5,
         tl.cex = 1, tl.col = "black",
         p.mat = testRes$p, insig = "blank", na.label = " ", 
         use = "complete.obs", addCoef.col = col)

Plot the variables that correlate highly with some other variable

interesting <- which(
  abs(cor.matrix.filtered) > 0.7 & abs(cor.matrix.filtered) < 1, 
  arr.ind = TRUE)

boston.cor <- boston[, which(
  colnames(boston) %in% rownames(interesting))]

# pairs(boston.cor, col = boston$chas)

library(GGally)

lower <- function(data, mapping, method="loess", ...){
      p <- ggplot(data = data, mapping = mapping) + 
      geom_point() + 
      geom_smooth(method=method, ...)
      p
}

p <- ggpairs(boston.cor, mapping = aes(alpha = 0.3), 
             lower = list(continuous = lower),  
             upper = list(continuous = wrap("cor", size = 2)))

p

We see that e.g. nox and age have a clear non-linear pattern (modeling with lm reveals heteroscedasticity of the residuals vs. fitted values). Other variables produce two groups of points with rad or tax.

Then plot the other ones that do not correlate strongly.

boston.cor <- boston[, which(!  # not
  colnames(boston) %in% c(rownames(interesting), "chas"))]

# c("nox", "dis", "tax", "indus", "age", "rad", "medv", "lstat")

# pairs(boston.cor, col = boston$chas)

lower <- function(data, mapping, method="loess", ...){
      p <- ggplot(data = data, mapping = mapping) + 
      geom_point() + 
      geom_smooth(method=method, ...)
      p
}

p <- ggpairs(boston.cor, mapping = aes(alpha = 0.3), 
             lower = list(continuous = lower),  
             upper = list(continuous = wrap("cor", size = 2)))

p

Here we also see some interesting patterns, which are left for future analyses.

Then plot variable distributions, stratify by binary chas

library(reshape2)

boston$chas <- as.factor(boston$chas)

boston.m <- 
  boston %>%
    melt(id.vars = c("chas"))

ggplot(boston.m, aes(value, col = chas)) +
  geom_histogram(aes(fill = chas)) + facet_wrap(~variable, scales = "free")

Most of the distributions are not close to normal, and most observations belong to chas == 0 class. Distributions of indus, tax and rad have two peaks.

Standardization

boston.std <- as.data.frame(
  scale(boston[, which(!(colnames(boston) %in% c("chas")))]))

boston.std$chas <- boston$chas

summary(boston.std)
##       crim                 zn               indus              nox         
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-1.4644  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.9121  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.1441  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.: 0.5981  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv         chas   
##  Min.   :-1.9063   0:471  
##  1st Qu.:-0.5989   1: 35  
##  Median :-0.1449          
##  Mean   : 0.0000          
##  3rd Qu.: 0.2683          
##  Max.   : 2.9865
apply(boston.std, 2, var)
##       crim         zn      indus        nox         rm        age        dis 
## 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 
##        rad        tax    ptratio      black      lstat       medv       chas 
## 1.00000001 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 0.06451297

All variables except the factorized chas are mean centered, and have unit variance. Plot the distributions for clarity

boston.m <- 
  boston.std %>%
    melt(id.vars = c("chas"))

ggplot(boston.m, aes(value, col = chas)) +
  geom_histogram(aes(fill = chas)) + facet_wrap(~variable, scales = "free")

Create categorical variable of crime rate by quantiles

bins <- quantile(boston.std$crim)

crime <- cut(boston.std$crim, breaks = bins, 
             include.lowest = TRUE, 
             labels = c("low", "med_low", "med_high", "high"))

boston.std <- dplyr::select(boston.std, -crim)

boston.std$crime <- crime

Divide into train and test set (follow course’s Datacamp exercise)

# standardize the chas as we need later
boston.std$chas <- scale(as.numeric(boston.std$chas))

n <- nrow(boston.std)

set.seed(123)  # for reproducibility
# sample n * 0.8 entries from 1:n
ind <- sample(n,  size = n * 0.8)

train <- boston.std[ind,]
test <- boston.std[-ind,]

# save the correct classes from test data
test.correct.classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

LDA

Fit LDA on all variables as instructed. Note that none of the variable distribution is normal by Shapiro-Wilk normality test

# sapply(boston.std %>% select(-chas) %>% select(-crime), shapiro.test)

# fit LDA
lda.fit <- lda(crime ~ ., data = train)

# plot biplot (follow course's Datacamp exercise)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", 
                       tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2.25)

From the biplot we see high crime rates are separated from the others well. rad, zn, (indus and nox) stand out from other variables, with rad being approximately orthogonal to the other ones. Medium high crime rates are separated pretty well from the low ones. Medium low rates overlap with low and medium high rates. There are a few outliers in all classes that are far from the center of mass of their respective group.

# print lda.fit details
# The prior probabilities seem to reflect the distribution in train data, 
# but are not identical
table(train$crime) / nrow(train)
## 
##       low   med_low  med_high      high 
## 0.2549505 0.2500000 0.2500000 0.2450495
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2500000 0.2500000 0.2450495 
## 
## Group means:
##                   zn      indus        nox          rm        age        dis
## low       1.01866313 -0.9066422 -0.8835724  0.38666334 -0.9213895  0.9094324
## med_low  -0.07141687 -0.3429155 -0.5768545 -0.09882533 -0.3254849  0.3694282
## med_high -0.40148591  0.2162741  0.3637157  0.12480815  0.4564260 -0.3720478
## high     -0.48724019  1.0149946  1.0481437 -0.47733231  0.8274496 -0.8601246
##                 rad        tax    ptratio      black        lstat        medv
## low      -0.6819317 -0.7458486 -0.4234280  0.3729083 -0.766883775  0.47009410
## med_low  -0.5395408 -0.4205644 -0.1079710  0.3103406 -0.164921798  0.01548761
## med_high -0.4349280 -0.3191090 -0.2716959  0.1049654  0.009720124  0.19440747
## high      1.6596029  1.5294129  0.8057784 -0.6383964  0.900379309 -0.71294711
##                 chas
## low      -0.08120770
## med_low   0.03952046
## med_high  0.19544522
## high     -0.03371693
## 
## Coefficients of linear discriminants:
##                  LD1         LD2        LD3
## zn       0.148390645  0.74870532 -1.0874785
## indus    0.040971465 -0.38126652 -0.1619456
## nox      0.312458150 -0.67564471 -1.3104018
## rm       0.011086947 -0.16058718 -0.1572603
## age      0.283482486 -0.38817624 -0.1013491
## dis     -0.079848789 -0.38493775  0.2108038
## rad      3.718978412  0.93123532 -0.4706522
## tax     -0.015197127  0.04230505  1.2889075
## ptratio  0.180294774 -0.01592588 -0.3558490
## black   -0.136724112  0.02948075  0.1288959
## lstat    0.145739238 -0.37823065  0.3345688
## medv     0.061327205 -0.53906503 -0.1509890
## chas     0.002460776  0.03963849  0.1699312
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9523 0.0364 0.0113

Prediction with LDA

# predict the classes
test.predictions <- predict(lda.fit, test)

# tabulate the results
table(test.predictions$class, test.correct.classes)
##           test.correct.classes
##            low med_low med_high high
##   low       10       2        1    0
##   med_low   10      17        9    0
##   med_high   4       6       13    1
##   high       0       0        2   27

High class is predicted well, as was expected. The other classes are predicted less accurately, med_high is often predicted as med_low, and low as med_low. Thus with this model we can only reliably separate high from other cases.

K-means

Reload boston and standardize.

boston <- MASS::Boston

boston <- scale(boston)

# Get pairwise distances 
boston.d <- dist(boston, method = "euclidean")
# Test different number of clusters
ks <- 1:15

# Determine optimal number of cluster by total within cluster 
# sum of squares

# We run k-means on the scaled data and not on the distances, 
# since k-means is not applicable to be run on distance matrix alone, see e.g.
# https://stackoverflow.com/questions/43512808/ and
# https://stats.stackexchange.com/questions/32925/
twcss <- sapply(ks, function(x) { kmeans(boston, centers = x)$tot.withinss})

# Plot
qplot(x = ks, y = twcss, geom = "line") + theme_bw()

Well, it is difficult to see what is the optimal number of clusters. Three or four is probably OK, two likely not large enough although that is where the largest drop in twcss occurs. Let’s run a (classical) multidimensional scaling in 2D on the distance matrix and see if there are clear clusters.

mds.fit <- cmdscale(boston.d, k = 2)

plot(mds.fit[, 1], mds.fit[, 2], 
     xlab = "Dimension 1", ylab = "Dimension 2")

Well, it seems like there are two clusters obtained with MDS.

Let’s run a k-medoid on the distance matrix and use the silhouette plot this time.

library(cluster)
library(factoextra)
# adapted from https://www.r-bloggers.com/2019/01/10-tips-for-choosing-the-optimal-number-of-clusters/

fviz_nbclust(as.matrix(boston.d), pam, method = "silhouette", k.max = 12) +
  theme_minimal() + ggtitle("Silhoutte plot for k-medoids")

With k-medoids, we get two clusters as the optimal.

Finally, plot dendrogram

# Choose average clustering as a "generic" one
hier.clust <-  hclust(boston.d, method = "average")
plot(hier.clust, cex = 0.5)
abline(h = 5.75, col = "red")

Hierarchical clustering suggests three big clusters and one or two smaller ones.

Since in k-means we noted that 3-4 clusters could be good, decrease the number to 3 based on the short analysis with MDS, k-medoids and hclust.

Let’s visualize the results

km <- kmeans(boston, centers = 3)

cols <- rep("black", nrow(boston))
cols[km$cluster == 2] <- "blue"
cols[km$cluster == 3] <- "red"

# make more compact, hide axes
pairs(boston, col = cols,
      gap = 0, xaxt = "n", yaxt = "n", pch = 3)

Pairs plot does not appear to support the hypothesis of three clusters explainable by any pair of variables, although there are clearly many plots with two groups e.g. with rad variable.

# Visualize k-means result in 2D with principal components 
# (from PCA, shows also the percentage of variance explained).

factoextra::fviz_cluster(km, data = boston,
             #palette = c("#2E9FDF", "#00AFBB", "#E7B800"), 
             geom = "point",
             ellipse.type = "convex", 
             ggtheme = theme_bw())

km2 <- kmeans(boston, centers = 2)

factoextra::fviz_cluster(km2, data = boston,
             #palette = c("#2E9FDF", "#00AFBB", "#E7B800"), 
             geom = "point",
             ellipse.type = "convex", 
             ggtheme = theme_bw())

Two clusters overlap less than three, but clustering into three groups is not too bad.

LDA on K-means clusters

Run LDA on k-means clusters (without train-test split).

boston <- as.data.frame(scale(MASS::Boston))

km <- kmeans(boston, centers = 3)

boston$class <- km$cluster
# fit LDA
lda.fit.km <- lda(class ~ ., data = boston)

# plot the lda results
plot(lda.fit.km, dimen = 2, col = boston$class, pch = boston$class) 
lda.arrows(lda.fit.km, myscale = 2.5)

Separation of the groups is similar to what we had before: one clearly different, and two overlapping. rad is again important variable to separate a group from the others. This time also tax is similar in direction and magnitude as rad, which we expected from correlation analysis. age and dis separate clusters 2 and 3 stronger than the other variables. They are more or less orthogonal to rad and tax.

# tabulate how k-means clusters are reproduced
table(predict(lda.fit.km)$class, km$cluster)
##    
##       1   2   3
##   1 133   0   0
##   2   0 152   8
##   3   9  10 194
# Seems to be reproduced pretty well.

3D plots

Reuse train set from above, follow instructions of the course

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)

plot_ly(x = matrix_product$LD1, 
        y = matrix_product$LD2, 
        z = matrix_product$LD3, 
        type = "scatter3d", mode = "markers", 
        color = train$crime)

Plot the same with three k-means clusters

km <- kmeans(train %>% select(-crime), centers = 3)

plot_ly(x = matrix_product$LD1, 
        y = matrix_product$LD2, 
        z = matrix_product$LD3, 
        type = "scatter3d", mode = "markers", 
        color = as.factor(km$cluster))

Now it is more apparent how there might be three clusters. In both plots there is one high crime rate cluster, and the rest of data points form two clouds: a low-to-med_low cluster and a med_high_to_med_low cluster.

All in all, for prediction tasks, reliable models can be built for detecting high and non-high binary classes. Splitting non-high into two groups will decrease prediction accuracy.


Week 5 - Dimensionality reduction techniques

date()
## [1] "Sat Dec 11 23:11:52 2021"

Introduction

This week, we are analyzing and modeling a dataset that describes various aspect of human development and gender inequalities stratified by countries. Data description is found here and here.

Dataset

Begin by loading the analysis data. Let’s use the “official” csv-file; an R script to produce the wrangled data is in the repo.

rm(list=ls())

# read in the data, convert strings to factors
human <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt",
                  header = TRUE, sep = ",", stringsAsFactors = TRUE)

dim(human)
## [1] 155   8
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Life.Exp spans the range [49, 83.50], GNI [581, 123124], maternal mortality ratio (per 100000) [1.0, 1100] and adolescent birth rate (per 100000) [0.60, 204.80].

Plot the scatterplot and correlations between the variables.

library(GGally)
library(ggplot2)

lower <- function(data, mapping, method="loess", ...){
      p <- ggplot(data = data, mapping = mapping) + 
      geom_point() + 
      geom_smooth(method=method, ...)
      p
}

p <- ggpairs(human, mapping = aes(alpha = 0.3), 
             lower = list(continuous = lower),  
             upper = list(continuous = wrap("cor", size = 2))) + 
      theme(text = element_text(size = 8))

# Add color to correlations
# stackoverflow.com/questions/45873483/

# correlations matrix plot
p.corr <- ggcorr(human, label = TRUE, label_round = 3,
                 palette = "RdBu")

# get colors
g2 <- ggplotGrob(p.corr)
colors <- g2$grobs[[6]]$children[[3]]$gp$fill

# Change background color to tiles in the upper triangular matrix of plots 
idx <- 1
for (k1 in 1:(ncol(human) - 1)) {
  for (k2 in (k1 + 1):ncol(human)) {
    plt <- getPlot(p, k1, k2) +
     theme(panel.background = element_rect(fill = colors[idx], color="white"),
           panel.grid.major = element_line(color = colors[idx]))
    p <- putPlot(p, plt, k1, k2)
    idx <- idx + 1
  }
}

p

There are many significant strong correlations. Eg. Life.Exp correlates positively with Edu.Exp, and Mat.Mor correlates positively with Ado.Birth. Life.Exp, Edu.Exp correlate negatively with Mat.Mor. Life.Exp and Ado.Birth correlate negatively as well.

Let’s examine the correlations with corrplot also to see better the groups of positive and negative correlations. Only Labo.FM and Parli.F do not correlate strongly with anything.

library(corrplot)

cor.matrix <- cor(human)

rownames(cor.matrix) <- colnames(cor.matrix)

corrplot(as.matrix(cor.matrix), method = "ellipse", 
         order = "AOE", number.cex = 0.5,  # Use some nice ordering
         tl.cex = 1, tl.col = "black", addCoef.col = 'black')

Then plot histograms to assess the distributions.

library(tidyverse)

gather(human) %>% ggplot(aes(value)) + 
  facet_wrap("key", scales = "free") + 
  geom_histogram() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Edu.Exp is close to normal distribution (passes Shapiro-Wilk normality test). GNI, Mat.Mor and Ado.Birth are right skewed, Life.Exp is left skewed.

PCA

Perform PCA on non-standardized data first (follow code from course’s DataCamp)

human.pca <- prcomp(human)  # SVD

s <- summary(human.pca)

# rounded percentages of variance captured by each PC
pca.pr <- round(100 * s$importance[2,], digits = 1) 

# create object pc_lab to be used as axis labels
pc.lab <- paste0(names(pca.pr), " (", pca.pr, "%)")

# draw a biplot
biplot(human.pca, cex = c(0.6, 0.8), col = c("grey40", "darkred"), 
       xlab = pc.lab[1], ylab = pc.lab[2])
Figure 1. PCA on non-standardized data. GNI's absolute variance is very large and it ends up explaining nearly all the absolute variance.

Figure 1. PCA on non-standardized data. GNI’s absolute variance is very large and it ends up explaining nearly all the absolute variance.

Well, since the data is not standardized, GNI appears to capture virtually all variability in the data as it has the highest absolute variance by far. Qatar stands out along the first PC and is opposite of e.g. Chad in GNI.

sort(sapply(human, var))
##      Labo.FM      Edu2.FM      Edu.Exp     Life.Exp      Parli.F    Ado.Birth 
## 3.951293e-02 5.838970e-02 8.067024e+00 6.942328e+01 1.319683e+02 1.690201e+03 
##      Mat.Mor          GNI 
## 4.485483e+04 3.438745e+08

Let’s standardize and replot.

human.pca <- prcomp(human, scale = TRUE)

s <- summary(human.pca)

# rounded percentages of variance captured by each PC
pca.pr <- round(100 * s$importance[2,], digits = 1) 

# create object pc_lab to be used as axis labels
pc.lab <- paste0(names(pca.pr), " (", pca.pr, "%)")

# draw a biplot
biplot(human.pca, cex = c(0.6, 0.8), col = c("grey40", "darkred"), 
       xlab = pc.lab[1], ylab = pc.lab[2])
Figure 2. PCA on standardized data. Life expectancy, expected years of schooling and ratio of female to male populations with secondary education correlate with the first principal component and are of approximately same magnitude. They have an opposite effect to adolescent birth rate and maternal mortality. Ratio of labour force participation of females to males and percent of parliament representation for females are orthogonal to other variables and correlate with the second principal component.

Figure 2. PCA on standardized data. Life expectancy, expected years of schooling and ratio of female to male populations with secondary education correlate with the first principal component and are of approximately same magnitude. They have an opposite effect to adolescent birth rate and maternal mortality. Ratio of labour force participation of females to males and percent of parliament representation for females are orthogonal to other variables and correlate with the second principal component.

Now the two first principal components explain 69.8% of variation in the data, mainly due to PC1 (53.6%). Life.Exp, Edu.Exp and Edu2.FM all point along PC1 in the same direction, Mot.Mor and Ado.Birth point in the opposite direction, which is coherent with the correlation analysis. Parli.F and Labo.FM correlate with each other and PC2; as we calculated before, correlation is not very strong.

Third principal component explains around 9.6% of the variance and the following components less. 8 principal components are needed to fully explain the variance of the data.

summary(human.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

PCA interpretation

Based on the deductions above, we can conclude that PC1 mostly captures the variance in Life.Exp, Edu.Exp and Edu2.FM, as well as Mot.Mor and Ado.Birth. It explains 53.6% of the total variance in the standardized human data. Mot.Mor and Ado.Birth point very intuitively in the opposite direction to education, education ratio of females to males, and to life expectancy.

PC2 captures variance in Parli.F and Labo.FM, explaining 16.2% of the total variance in the data. These variables exhibit their effects in the same direction along PC2. Thus female parliament and labor representation are suggested to have similar effects.

All in all, the first two PCs are enough to approximately represent the data as the total variance explained is relatively good, 69.8%.

Tea dataset

Load the tea dataset

library(FactoMineR)

data(tea)

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
table(sapply(tea, typeof))
## 
## integer 
##      36

The FactoMineR tea data has 300 rows and 36 columns, all of them except for age are factors encoded into integers. It is questionnaire data, from documentation:

“We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).”

Visualize the data

# tea %>% select(-age) %>%
tea.cat <- tea %>% select(-age)
  
p <- 
  tea %>% 
    select(-age) %>%
    gather %>% 
    group_by(value) %>% 
    mutate(count = n()) %>%
    ggplot(aes(value)) + geom_point(y = .5, aes(size = count)) + 
      facet_wrap("key", scales = "free", strip.position = "top", ncol = 3) + 
      # facet_grid("key", scales = "free", space = "free") +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1), 
            panel.spacing = unit(2, "lines")) + 
      coord_flip() + 
      theme_bw()

# https://stackoverflow.com/questions/52341385/
p

Plot age separately

ggplot(tea, aes(age)) + geom_histogram() + theme_bw()

MCA

Perform multiple correspondence analysis (with indicator matrix) on the FactoMineR tea data.

mca <- MCA(tea %>% select(-age), graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea %>% select(-age), graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.090   0.082   0.070   0.063   0.056   0.053   0.050
## % of var.              5.838   5.292   4.551   4.057   3.616   3.465   3.272
## Cumulative % of var.   5.838  11.130  15.681  19.738  23.354  26.819  30.091
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.048   0.047   0.044   0.041   0.040   0.039   0.037
## % of var.              3.090   3.053   2.834   2.643   2.623   2.531   2.388
## Cumulative % of var.  33.181  36.234  39.068  41.711  44.334  46.865  49.252
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.036   0.035   0.034   0.032   0.031   0.031   0.030
## % of var.              2.302   2.275   2.172   2.085   2.013   2.011   1.915
## Cumulative % of var.  51.554  53.829  56.000  58.086  60.099  62.110  64.025
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
## Variance               0.028   0.027   0.026   0.025   0.025   0.024   0.024
## % of var.              1.847   1.740   1.686   1.638   1.609   1.571   1.524
## Cumulative % of var.  65.872  67.611  69.297  70.935  72.544  74.115  75.639
##                       Dim.29  Dim.30  Dim.31  Dim.32  Dim.33  Dim.34  Dim.35
## Variance               0.023   0.022   0.021   0.020   0.020   0.019   0.019
## % of var.              1.459   1.425   1.378   1.322   1.281   1.241   1.222
## Cumulative % of var.  77.099  78.523  79.901  81.223  82.504  83.745  84.967
##                       Dim.36  Dim.37  Dim.38  Dim.39  Dim.40  Dim.41  Dim.42
## Variance               0.018   0.017   0.017   0.016   0.015   0.015   0.014
## % of var.              1.152   1.092   1.072   1.019   0.993   0.950   0.924
## Cumulative % of var.  86.119  87.211  88.283  89.301  90.294  91.244  92.169
##                       Dim.43  Dim.44  Dim.45  Dim.46  Dim.47  Dim.48  Dim.49
## Variance               0.014   0.013   0.012   0.011   0.011   0.010   0.010
## % of var.              0.891   0.833   0.792   0.729   0.716   0.666   0.660
## Cumulative % of var.  93.060  93.893  94.684  95.414  96.130  96.796  97.456
##                       Dim.50  Dim.51  Dim.52  Dim.53  Dim.54
## Variance               0.009   0.009   0.008   0.007   0.006
## % of var.              0.605   0.584   0.519   0.447   0.390
## Cumulative % of var.  98.060  98.644  99.163  99.610 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1             | -0.580  1.246  0.174 |  0.155  0.098  0.012 |  0.052  0.013
## 2             | -0.376  0.522  0.108 |  0.293  0.350  0.066 | -0.164  0.127
## 3             |  0.083  0.026  0.004 | -0.155  0.099  0.015 |  0.122  0.071
## 4             | -0.569  1.196  0.236 | -0.273  0.304  0.054 | -0.019  0.002
## 5             | -0.145  0.078  0.020 | -0.142  0.083  0.019 |  0.002  0.000
## 6             | -0.676  1.693  0.272 | -0.284  0.330  0.048 | -0.021  0.002
## 7             | -0.191  0.135  0.027 |  0.020  0.002  0.000 |  0.141  0.095
## 8             | -0.043  0.007  0.001 |  0.108  0.047  0.009 | -0.089  0.038
## 9             | -0.027  0.003  0.000 |  0.267  0.291  0.049 |  0.341  0.553
## 10            |  0.205  0.155  0.028 |  0.366  0.546  0.089 |  0.281  0.374
##                 cos2  
## 1              0.001 |
## 2              0.021 |
## 3              0.009 |
## 4              0.000 |
## 5              0.000 |
## 6              0.000 |
## 7              0.015 |
## 8              0.006 |
## 9              0.080 |
## 10             0.052 |
## 
## Categories (the 10 first)
##                  Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test  
## breakfast     |  0.182  0.504  0.031  3.022 |  0.020  0.007  0.000  0.330 |
## Not.breakfast | -0.168  0.465  0.031 -3.022 | -0.018  0.006  0.000 -0.330 |
## Not.tea time  | -0.556  4.286  0.240 -8.468 |  0.004  0.000  0.000  0.065 |
## tea time      |  0.431  3.322  0.240  8.468 | -0.003  0.000  0.000 -0.065 |
## evening       |  0.276  0.830  0.040  3.452 | -0.409  2.006  0.087 -5.109 |
## Not.evening   | -0.144  0.434  0.040 -3.452 |  0.214  1.049  0.087  5.109 |
## lunch         |  0.601  1.678  0.062  4.306 | -0.408  0.854  0.029 -2.924 |
## Not.lunch     | -0.103  0.288  0.062 -4.306 |  0.070  0.147  0.029  2.924 |
## dinner        | -1.105  2.709  0.092 -5.240 | -0.081  0.016  0.000 -0.386 |
## Not.dinner    |  0.083  0.204  0.092  5.240 |  0.006  0.001  0.000  0.386 |
##                Dim.3    ctr   cos2 v.test  
## breakfast     -0.107  0.225  0.011 -1.784 |
## Not.breakfast  0.099  0.208  0.011  1.784 |
## Not.tea time   0.062  0.069  0.003  0.950 |
## tea time      -0.048  0.054  0.003 -0.950 |
## evening        0.344  1.653  0.062  4.301 |
## Not.evening   -0.180  0.864  0.062 -4.301 |
## lunch          0.240  0.343  0.010  1.719 |
## Not.lunch     -0.041  0.059  0.010 -1.719 |
## dinner         0.796  1.805  0.048  3.777 |
## Not.dinner    -0.060  0.136  0.048 -3.777 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.031 0.000 0.011 |
## tea.time      | 0.240 0.000 0.003 |
## evening       | 0.040 0.087 0.062 |
## lunch         | 0.062 0.029 0.010 |
## dinner        | 0.092 0.000 0.048 |
## always        | 0.056 0.035 0.007 |
## home          | 0.016 0.002 0.030 |
## work          | 0.075 0.020 0.022 |
## tearoom       | 0.321 0.019 0.031 |
## friends       | 0.186 0.061 0.030 |

Eigenvalue output contains the variances and the respective percentage of variance explained by each dimension.

Individuals output shows the coordinates, contribution of the individual as percent, and the squared cosine which measures the degree of association between variable categories and a dimension [1].

Categories are detailed similarly to individuals. The v.test measures if the category is significantly different from zero along a dimension (is abs(v.test) > 2, sign gives direction along the dimension). Categorical variables section shows squared correlation ratio to the dimensions.

Before going into interpretation, let’s - for pedagogical reasons - follow the example of a FactoMineR developer and redo the MCA with age as a quantitative supplementary variable and a set of others as qualitative supplementary variables.

mca <- MCA(tea, quanti.sup = 19, quali.sup = 20:36, graph = FALSE)  # 19 = "age"
summary(mca)
## 
## Call:
## MCA(X = tea, quanti.sup = 19, quali.sup = 20:36, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.148   0.122   0.090   0.078   0.074   0.071   0.068
## % of var.              9.885   8.103   6.001   5.204   4.917   4.759   4.522
## Cumulative % of var.   9.885  17.988  23.989  29.192  34.109  38.868  43.390
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.065   0.062   0.059   0.057   0.054   0.052   0.049
## % of var.              4.355   4.123   3.902   3.805   3.628   3.462   3.250
## Cumulative % of var.  47.745  51.867  55.769  59.574  63.202  66.664  69.914
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.048   0.047   0.046   0.040   0.038   0.037   0.036
## % of var.              3.221   3.127   3.037   2.683   2.541   2.438   2.378
## Cumulative % of var.  73.135  76.262  79.298  81.982  84.523  86.961  89.339
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27
## Variance               0.035   0.031   0.029   0.027   0.021   0.017
## % of var.              2.323   2.055   1.915   1.821   1.407   1.139
## Cumulative % of var.  91.662  93.717  95.633  97.454  98.861 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1             | -0.541  0.658  0.143 | -0.149  0.061  0.011 | -0.306  0.347
## 2             | -0.361  0.293  0.133 | -0.078  0.017  0.006 | -0.633  1.483
## 3             |  0.073  0.012  0.003 | -0.169  0.079  0.018 |  0.246  0.224
## 4             | -0.572  0.735  0.235 |  0.018  0.001  0.000 |  0.203  0.153
## 5             | -0.253  0.144  0.079 | -0.118  0.038  0.017 |  0.006  0.000
## 6             | -0.684  1.053  0.231 |  0.032  0.003  0.001 | -0.018  0.001
## 7             | -0.111  0.027  0.022 | -0.182  0.090  0.059 | -0.207  0.159
## 8             | -0.210  0.099  0.043 | -0.068  0.013  0.004 | -0.421  0.655
## 9             |  0.118  0.031  0.012 |  0.229  0.144  0.044 | -0.538  1.070
## 10            |  0.258  0.150  0.045 |  0.478  0.627  0.156 | -0.482  0.861
##                 cos2  
## 1              0.046 |
## 2              0.409 |
## 3              0.038 |
## 4              0.030 |
## 5              0.000 |
## 6              0.000 |
## 7              0.077 |
## 8              0.174 |
## 9              0.244 |
## 10             0.158 |
## 
## Categories (the 10 first)
##                  Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test  
## breakfast     |  0.166  0.495  0.025  2.756 | -0.166  0.607  0.026 -2.764 |
## Not.breakfast | -0.153  0.457  0.025 -2.756 |  0.154  0.560  0.026  2.764 |
## Not.tea time  | -0.498  4.053  0.192 -7.578 |  0.093  0.174  0.007  1.423 |
## tea time      |  0.386  3.142  0.192  7.578 | -0.072  0.135  0.007 -1.423 |
## evening       |  0.319  1.307  0.053  3.985 | -0.058  0.053  0.002 -0.728 |
## Not.evening   | -0.167  0.683  0.053 -3.985 |  0.030  0.028  0.002  0.728 |
## lunch         |  0.659  2.385  0.075  4.722 | -0.390  1.018  0.026 -2.793 |
## Not.lunch     | -0.113  0.410  0.075 -4.722 |  0.067  0.175  0.026  2.793 |
## dinner        | -0.661  1.146  0.033 -3.136 |  0.796  2.025  0.048  3.774 |
## Not.dinner    |  0.050  0.086  0.033  3.136 | -0.060  0.152  0.048 -3.774 |
##                Dim.3    ctr   cos2 v.test  
## breakfast     -0.483  6.900  0.215 -8.017 |
## Not.breakfast  0.445  6.369  0.215  8.017 |
## Not.tea time   0.265  1.886  0.054  4.027 |
## tea time      -0.205  1.462  0.054 -4.027 |
## evening        0.451  4.312  0.106  5.640 |
## Not.evening   -0.236  2.254  0.106 -5.640 |
## lunch          0.301  0.822  0.016  2.160 |
## Not.lunch     -0.052  0.141  0.016 -2.160 |
## dinner         0.535  1.235  0.022  2.537 |
## Not.dinner    -0.040  0.093  0.022 -2.537 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.025 0.026 0.215 |
## tea.time      | 0.192 0.007 0.054 |
## evening       | 0.053 0.002 0.106 |
## lunch         | 0.075 0.026 0.016 |
## dinner        | 0.033 0.048 0.022 |
## always        | 0.045 0.001 0.101 |
## home          | 0.005 0.000 0.134 |
## work          | 0.112 0.043 0.005 |
## tearoom       | 0.372 0.022 0.008 |
## friends       | 0.243 0.015 0.103 |
## 
## Supplementary categories (the 10 first)
##                  Dim.1   cos2 v.test    Dim.2   cos2 v.test    Dim.3   cos2
## F             |  0.151  0.033  3.158 | -0.109  0.017 -2.278 | -0.048  0.003
## M             | -0.221  0.033 -3.158 |  0.159  0.017  2.278 |  0.070  0.003
## employee      | -0.153  0.006 -1.313 | -0.151  0.006 -1.289 |  0.103  0.003
## middle        | -0.030  0.000 -0.205 |  0.336  0.017  2.281 | -0.284  0.012
## non-worker    | -0.036  0.000 -0.324 |  0.185  0.009  1.666 | -0.291  0.023
## other worker  |  0.040  0.000  0.187 |  0.013  0.000  0.061 | -0.063  0.000
## senior        |  0.415  0.023  2.608 |  0.072  0.001  0.452 | -0.187  0.005
## student       |  0.032  0.000  0.305 | -0.317  0.031 -3.022 |  0.394  0.047
## workman       | -0.417  0.007 -1.473 |  0.249  0.003  0.878 |  0.343  0.005
## Not.sportsman | -0.030  0.001 -0.426 |  0.018  0.000  0.260 | -0.051  0.002
##               v.test  
## F             -0.998 |
## M              0.998 |
## employee       0.884 |
## middle        -1.928 |
## non-worker    -2.620 |
## other worker  -0.289 |
## senior        -1.177 |
## student        3.760 |
## workman        1.209 |
## Not.sportsman -0.721 |
## 
## Supplementary categorical variables (eta2)
##                    Dim.1 Dim.2 Dim.3  
## sex              | 0.033 0.017 0.003 |
## SPC              | 0.032 0.053 0.076 |
## Sport            | 0.001 0.000 0.002 |
## age_Q            | 0.008 0.077 0.146 |
## frequency        | 0.094 0.006 0.064 |
## escape.exoticism | 0.000 0.007 0.000 |
## spirituality     | 0.005 0.000 0.016 |
## healthy          | 0.000 0.000 0.008 |
## diuretic         | 0.004 0.000 0.013 |
## friendliness     | 0.071 0.001 0.013 |
## 
## Supplementary continuous variable
##                  Dim.1    Dim.2    Dim.3  
## age           |  0.042 |  0.204 | -0.340 |

The output of summary contains now also supplementary categories, for which there are no contributions as they are not used to build the explanatory dimensions.

Plot first the quantitative supplementary age variable. It seems age does not correlate strongly with the first two dimensions.

plot.MCA(mca, choix = "quanti.sup")

Then plot variable graph for the correlation ratios. where, tearoom and how correlate with the first dimension more than other variables. Similarly where, price and how correlate with the second dimension more than the other variables. Active variables are in red, suppl. categorical variables in green, and suppl. continuous variable age in blue.

plot.MCA(mca, choix = "var", xlim = c(0, 0.75), ylim = c(0, 0.75))

Then plot the graph for individuals and the category values. The plot is quite busy; it appears that the values tea shop (where), unpackaged (how) and p_upscale (price) are important for dimension 2. other(How [sic]) and tearoom (tearoom), chain store+tea shop (where) are important for dimension 1. Other data points are more clustered towards the middle.

plot.MCA(mca, choix = "ind")

Plot next only individuals and for the other plot the active categories. Plot only 50 individuals and 15 category values that contribute the most to the dimensions. Color the individuals by where variable.

For individuals, there appears to be a denser region in the third quadrant. Individuals contributing the most are on the periphery of the point cloud. It seems the individuals are separated nicely by where they get their tea from (where). This is not too suprising, as that variable was found to be important in the earlier plots.

Conclusions for the active categories are as before.

op <- par(mfcol=c(1, 2),
          mar = c(4, 4, 2, 1),
          mgp = c(2, 1, 0))

plot(mca, invisible = c("var", "quali.sup", "quanti.sup"), 
     title = "Individuals", select = "contrib 15", habillage = "where")
plot(mca, invisible = c("ind", "quali.sup", "quanti.sup"), 
     title = "Active categories", autoLab = "yes", cex = 0.5,
     selectMod = "contrib 15")

par(op) # return original par


Week 6 - Analysis of longitudinal data

date()
## [1] "Sat Dec 11 23:12:12 2021"

Introduction

This week, we are analyzing and modeling two longitudinal datasets with measurements on brief psychiatric rating scale (BPRS) ([1] as referred in [2]) and rats weight by diet (RATS) ([3] as referred in [2]). The idea is to implement the analyses of Chapter 8 [2] for RATS data and of Chapter 9 [2] for BPRS data.

[1] Davis, C. S. (2002). Statistical Methods for the Analysis of Repeated Measurements. Springer, New York.

[2] Vehkalahti K. and Everitt B. S. (2019). Multivariate Analysis for the Behavioral Sciences, Second Edition.

[3] Crowder, M. J. and Hand, D. J. (1990). Analysis of Repeated Measurements. Chapman and Hall, London.

RATS

Begin by loading the analysis data in long form.

rm(list=ls())

# read in the data, convert strings to factors
RATSL <- read.csv("./data/ratsl.csv", header = TRUE, sep = ",", row.names = 1)

dim(RATSL)
## [1] 176   4
str(RATSL)
## 'data.frame':    176 obs. of  4 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
# convert to factor
RATSL$ID <- as.factor(RATSL$ID)
RATSL$Group <- as.factor(RATSL$Group)
summary(RATSL)
##        ID      Group      Weight           Time      
##  1      : 11   1:88   Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   Median :344.5   Median :36.00  
##  4      : 11          Mean   :384.5   Mean   :33.55  
##  5      : 11          3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11          Max.   :628.0   Max.   :64.00  
##  (Other):110

Weights vary from 225 to 628 g, and there the Time spans 64 days.

Growth profiles

First plot individual response profiles by Group, color lines by ID to avoid recurring linestyles.

library(ggplot2)

p <- ggplot(RATSL, aes(x = Time, y = Weight)) + 
  geom_line(aes(color = ID)) + facet_wrap(~Group, labeller = label_both) + 
  ylab("Weigth (g)") + xlab("Day") + theme_bw()

p

It seems that the first group had lower rat weights. In the second group there is an outlier weighing over 550 g on the first measurement day, and the third group weighs around 500 g in the beginning of the study. With this graph we can track the weight response of the rats by groups. To focus on the variability, let’s standardize the data and replot.

library(tidyverse)

# Standardize weight
RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdweight = (Weight - mean(Weight))/sd(Weight) ) %>%
  ungroup()

p <- ggplot(RATSL, aes(x = Time, y = stdweight)) + 
  geom_line(aes(color = ID)) + facet_wrap(~Group, labeller = label_both) + 
  ylab("Standardized weigth") + xlab("Day") + theme_bw()

p

Now for each measurement day, the weights are standardized around zero to unit variance. It is easier to see the weight differences by ID.

Next, plot mean response profile for the three groups, add standard error.

# Count of measurement days
n <- RATSL$Time %>% unique() %>% length()

# Mean and standard error of weight by Group and Time 
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), se = sd(Weight) / sqrt(n)) %>%
  ungroup()

# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() + geom_point() +
  scale_shape_manual(values = c(1, 2, 4)) + # nicer shapes
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
  ylab("mean(Weigth) +/- se(Weight)") + xlab("Day") +
  theme_bw() +
  theme(legend.position = c(0.9, 0.4))

Group 1 is clearly lighter than others. Groups 2 and 3 are quite close in weight (without overlapping). There is more variation in group 2 due to the outlier.

Alternatively, plot boxplot to display variance between groups by time.

p <- ggplot(RATSL, aes(x = factor(Time), y = Weight, fill = Group)) + 
  geom_boxplot() +
  ylab("Weight (g)") + xlab("Day") +
  theme_bw() +
  theme(legend.position = c(0.9, 0.4))

p

For this data, the previous plot was clearer as the boxes here are far apart on the y axes and difficult to follow.

Summary measures

RATS is a growth data. Choose mean as the summary measure and plot (linear regression performed later).

# Filter the first day, group by Group and ID, summarize by group
RATSLD <- RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise(mean = mean(Weight)) %>%
  ungroup()

# Boxplot
p <- ggplot(RATSLD, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape = 23, size = 4, fill = "white") +
  ylab("mean(Weight), days 8-64") +
  theme_bw()

p

Boxplot shows an outlier for each group. Groups 2 and 3 are somewhat skewed. We will not systematically remove outliers based on this plot, because that would amount to almost 20% of data (3 / 16). Outlier from group 2 could however be probably removed, although that too would amount to removing 25% of the group data. Should we want to remove the outlier, the code would be

outID <- RATSL %>% filter(Time == 64 & Weight > 600) %>% select(ID)

RATSLout <- RATSL %>%
  filter(ID != outID$ID)

# Verify with a plot
ggplot(RATSLout, aes(x = Time, y = stdweight)) + 
  geom_line(aes(color = ID)) + facet_wrap(~Group, labeller = label_both) + 
  ylab("Standardized weigth") + xlab("Day") + theme_bw()

Work with the full data (no outiliers removed), perform Anova and t-tests to assess if there is difference between group means. (Follow the guide here.)

library(rstatix)

aov.rats <- RATSL %>% anova_test(Weight ~ Group)
aov.rats 
## ANOVA Table (type II tests)
## 
##   Effect DFn DFd       F        p p<.05  ges
## 1  Group   2 173 998.175 9.97e-96     * 0.92

Clearly, there are differences between groups.

# Perform pairwise t-tests, adjust with Bonferroni
# Assume equal variance, but see
# https://stats.stackexchange.com/questions/305/

pairwise.ts <- RATSL %>%
  pairwise_t_test(Weight ~ Group, p.adjust.method = "bonferroni", detailed = FALSE, 
                  pool.sd = FALSE, var.equal = TRUE)  # pool.sd == FALSE -> normal t-tests

pairwise.ts
## # A tibble: 3 x 10
##   .y.    group1 group2    n1    n2 statistic    df         p     p.adj
## * <chr>  <chr>  <chr>  <int> <int>     <dbl> <dbl>     <dbl>     <dbl>
## 1 Weight 1      2         88    44    -30.5    130 4   e- 61 1.2 e- 60
## 2 Weight 1      3         88    44    -79.0    130 9.6 e-112 2.88e-111
## 3 Weight 2      3         44    44     -3.91    86 1.86e-  4 5.58e-  4
## # … with 1 more variable: p.adj.signif <chr>

All group differences appear significant. Add box plot on full (not mean) data with the signficance annotation.

library(ggpubr)

# Add positions for significance indincators
pairwise.ts <- pairwise.ts %>% add_xy_position(x = "Group")

ggboxplot(RATSL, x = "Group", y = "Weight") +
  stat_pvalue_manual(pairwise.ts, hide.ns = TRUE, label = "p.adj.signif") +
  labs(
    subtitle = get_test_label(aov.rats, detailed = TRUE),
    caption = get_pwc_label(pairwise.ts)
    )

Eta squared is high, 92% of variance in weight is explained by group.

Also perform the pairwise t-test on group means

RATSLD %>%
  pairwise_t_test(mean ~ Group, p.adjust.method = "bonferroni", pool.sd = FALSE, var.equal = TRUE)
## # A tibble: 3 x 10
##   .y.   group1 group2    n1    n2 statistic    df        p    p.adj p.adj.signif
## * <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl>    <dbl>    <dbl> <chr>       
## 1 mean  1      2          8     4     -9.06    10 3.88e- 6 1.16e- 5 ****        
## 2 mean  1      3          8     4    -27.8     10 8.34e-11 2.5 e-10 ****        
## 3 mean  2      3          4     4     -1.07     6 3.26e- 1 9.78e- 1 ns

The difference in mean is not significant between groups 2 and 3.

Finally, incorporate the first day as baseline to use in the covariate analysis.

RATSW <- RATSL %>% 
  select(-stdweight) %>% 
  pivot_wider(names_from = Time, values_from = Weight, names_prefix = "week") 

RATSLD <- RATSLD %>%
  mutate(baseline = RATSW$`week1`)

# Fit the linear model with the mean as the response 
RATSW_lm <- lm(mean ~ baseline + Group, data = RATSLD)

# Compute the analysis of variance table for the fitted model
anova(RATSW_lm)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Baseline measurement is strongly significant having an effect on the weight mean. There doesn’t seem to be significant difference by groups. From the diagnostic plots below, we see however that there might be a heteroscedasticity as well as outlier problems (by Cook’s distance).

# make the plot tighter
op <- par(mfrow=c(2, 2),
          mar = c(4, 4, 2, 1),
          mgp = c(2, 1, 0))

plot(RATSW_lm)

par(op)

Check also the simple linear regression to compare the groups without the baseline

RATS_lm <- lm(Weight ~ Group, data = RATSL)
summary(RATS_lm)
## 
## Call:
## lm(formula = Weight ~ Group, data = RATSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.705 -19.026   1.284  11.287 143.295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  263.716      3.850   68.50   <2e-16 ***
## Group2       220.989      6.668   33.14   <2e-16 ***
## Group3       262.080      6.668   39.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.12 on 173 degrees of freedom
## Multiple R-squared:  0.9203, Adjusted R-squared:  0.9193 
## F-statistic: 998.2 on 2 and 173 DF,  p-value: < 2.2e-16

There are differences between the groups relative to group 1, but the linear model is again not terribly good:

# make the plot tighter
op <- par(mfrow=c(2, 2),
          mar = c(4, 4, 2, 1),
          mgp = c(2, 1, 0))

plot(RATS_lm)

par(op)

BPRS

Begin by loading the analysis data in long form.

# read in the data, convert strings to factors
BPRSL <- read.csv("./data/bprsl.csv", header = TRUE, sep = ",", row.names = 1)

dim(BPRSL)
## [1] 360   4
str(BPRSL)
## 'data.frame':    360 obs. of  4 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
# convert to factor
BPRSL$subject <- as.factor(BPRSL$subject)
BPRSL$treatment <- as.factor(BPRSL$treatment)
summary(BPRSL)
##  treatment    subject         week        bprs      
##  1:180     1      : 18   Min.   :0   Min.   :18.00  
##  2:180     2      : 18   1st Qu.:2   1st Qu.:27.00  
##            3      : 18   Median :4   Median :35.00  
##            4      : 18   Mean   :4   Mean   :37.66  
##            5      : 18   3rd Qu.:6   3rd Qu.:43.00  
##            6      : 18   Max.   :8   Max.   :95.00  
##            (Other):252
unique(BPRSL$subject)
##  [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

bprs values span from 18 to 95, there are 9 measurement weeks, 2 treatments and 20 subjects.

BPRS profiles

Plot the profiles with text labels by treatment.

p <- ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_text(aes(label = treatment, color = treatment)) +
  xlab("week") + 
  ylab("BRPS") + 
  theme_bw() +
  theme(legend.position = "none")

p

The plot is messy as the treatments overlap, even with colors. Plot with lines instead, and panel by treatment to increase clarity.

p <- ggplot(BPRSL, aes(x = week, y = bprs, color = subject)) +
  geom_line() +
  theme_bw() +
  theme(legend.position = "top") +
  facet_wrap(~treatment, labeller = label_both)

p

The plot is not terribly good either, it is hard to track the subjects and even harder to see the difference by treatment on a subject level. In treatment 2, there is an outlier, but we keep the subject is not an outlier in the other treatment group.

Let’s plot a mean aggregate profile, and then fit some models to learn more about the data.

# Count of measurement days
n <- BPRSL$week %>% unique() %>% length()

# Mean and standard error of weight by Group and time 
BPRSS <- BPRSL %>%
  group_by(treatment, week) %>%
  summarise(mean = mean(bprs), se = sd(bprs) / sqrt(n)) %>%
  ungroup()

# Plot the mean profiles
ggplot(BPRSS, aes(x = week, y = mean, linetype = treatment, shape = treatment)) +
  geom_line() + geom_point() +
  scale_shape_manual(values = c(1, 2, 4)) + # nicer shapes
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
  ylab("mean(bprs) +/- se(bprs)") + xlab("week") +
  theme_bw() +
  theme(legend.position = c(0.9, 0.8))

As suspected, there does not seem to be a big difference by treatment on any particular week.

Finally, plot a box plot for treatment and subject grouping and ignore the baseline

# from Datacamp exercise
BPRSL8S <- BPRSL %>%
  filter(week > 0) %>%
  group_by(treatment, subject) %>%
  summarise(mean = mean(bprs)) %>%
  ungroup()

ggplot(BPRSL8S, aes(x = treatment, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape = 23, size = 4, fill = "white") +
  ylab("mean(bprs)") +
  theme_bw()

There is an outlier in the treatment group 2 as seen before in the line plot.

Simple linear regression

Fit a linear regression model

BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)

summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Plot a quick overview of the residuals to see if a linear regression is appropriate.

# make the plot tighter
op <- par(mfrow=c(2, 2),
          mar = c(4, 4, 2, 1),
          mgp = c(2, 1, 0))

plot(BPRS_reg)

par(op) # return original par

Linear fit seems OK.

Week is significant, as well as the non-zero intercept. However, treatment conditional on week seems to not be significant. Maybe our model is not good enough to capture the treatment difference.

Plot scatterplot matrix of repeated measures to assess if there is a pattern between weeks.

# to wide
BPRSW <- BPRSL %>% 
  pivot_wider(names_from = week, values_from = bprs, names_prefix = "week") %>%
  select(-c(treatment, subject)) 

pairs(BPRSW[, 3:ncol(BPRSW)],
      gap = 0, xaxt = "n", yaxt = "n", pch = 3)

There appears to be autocorrelation, which reduces as the weeks get further apart. It makes sense to capture the autocorrelation with better models that take into account the repeated measures nature of the data. To that end, fit linear mixed models.

Before the models, plot correlations to get numerical values

library(corrplot)

cor.matrix <- cor(BPRSW)

rownames(cor.matrix) <- colnames(cor.matrix)

corrplot(as.matrix(cor.matrix), method = "ellipse", 
         order = "original", number.cex = 0.5,  # Use some nice ordering
         tl.cex = 1, tl.col = "black", addCoef.col = 'black')

Linear mixed models

Random intercept

First create a random intercept model.

library(lme4)

# optimize log-likelihood
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

The random effect variances for the subject and residuals appear reasonable: Subject random effects variance is not too large. Residual variance is smaller than in the simple baseline model.

Regression parameters for week and treatment appear identical to the baseline. The standard errors for both week and treatment2 are a bit smaller, likely reflecting incorporation of the within-subject dependence (unlike in the textbook example, here the treatment variable does not have a larger variance compared to the baseline model; this is perhaps due to larger group sizes than in the RATS data).

There is some negative correlation of the regression coefficients with the intercept (see here).

Compare to the baseline model using AIC:

# https://stackoverflow.com/questions/24019807/

AIC(BPRS_reg, BPRS_ref)
##          df      AIC
## BPRS_reg  4 2837.337
## BPRS_ref  5 2748.712

The mixed effects model is better. It would be informative to get the p-values as well:

# With the caveats in
# https://stats.stackexchange.com/questions/22988/

library(lmerTest)
fit <- lmerTest::lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(fit)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  46.4539     1.9090  37.2392  24.334   <2e-16 ***
## week         -2.2704     0.2084 340.0000 -10.896   <2e-16 ***
## treatment2    0.5722     1.0761 340.0000   0.532    0.595    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000
# or with anova
anova(fit)
## Type III Analysis of Variance Table with Satterthwaite's method
##            Sum Sq Mean Sq NumDF DenDF  F value Pr(>F)    
## week      12371.5 12371.5     1   340 118.7136 <2e-16 ***
## treatment    29.5    29.5     1   340   0.2828 0.5952    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Well, there is more output but the conclusion is the same. week has a significant effect on bprs while treatment group does not. Let’s use lmerTest::lmer from now on to print the p-values as well.

Next add variable slope.

Random intercept and slope model

BPRS_ref1 <- lmerTest::lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  46.4539     2.1052  22.6795  22.066  < 2e-16 ***
## week         -2.2704     0.2977  19.9991  -7.626 2.42e-07 ***
## treatment2    0.5722     1.0405 320.0005   0.550    0.583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

We reach the same conclusions as with the random intercept model. Residual variance is lower, but the subject variance increased. Now we also have data on the variance for week random effect and there is negative correlation for the variance of intercept and slope (see here).

AIC(BPRS_ref, BPRS_ref1)
##           df      AIC
## BPRS_ref   5 2748.712
## BPRS_ref1  7 2745.440
anova(BPRS_ref, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# or
# library(lmtest)
# lrtest(BPRS_ref, BPRS_ref1)

The likelihood ratio test and AIC suggest that the random intercept and slope model is better.

Random intercept and slope model with interaction term

Lastly, add an interaction term between week and treatment.

BPRS_ref2 <- lmerTest::lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)      47.8856     2.2521  29.6312  21.262  < 2e-16 ***
## week             -2.6283     0.3589  41.7201  -7.323 5.24e-09 ***
## treatment2       -2.2911     1.9090 319.9977  -1.200   0.2310    
## week:treatment2   0.7158     0.4010 319.9977   1.785   0.0752 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
# confint(BPRS_ref2, 'week:treatment2', level=0.95)

Interestingly, week and treatment group has a weakly significant interaction, suggesting that the bprs slope is higher in the treatment group 2 on average by 0.72. On closer inspection, the 95% CI [-0.07, 1.50] contains zero and the effect is lost.

AIC(BPRS_ref1, BPRS_ref2)
##           df      AIC
## BPRS_ref1  7 2745.440
## BPRS_ref2  8 2744.269
anova(BPRS_ref1, BPRS_ref2)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The more complicated interaction model does not fit the data significantly better by likelihood ratio test (unless divided by 2 as in the textbook, page 179), nor is the AIC too different.

Plot the fitted values. We see that indeed both the slope and the intercepts vary.

library(ggpubr)

# Create a new column fitted to RATSL
BPRSLF <- BPRSL %>%
  mutate(fitted = fitted(BPRS_ref2))

# BRPSL data
g1 <- ggplot(BPRSLF, aes(x = week, y = bprs, color = subject)) +
  geom_line(aes(linetype = treatment)) +
  theme(legend.position = "top")

# Fitted values with a mixed effects model
g2 <- ggplot(BPRSLF, aes(x = week, y = fitted, color = subject)) +
  geom_line(aes(linetype = treatment)) +
  theme(legend.position = "top")

ggarrange(g1, g2,
          ncol = 2, nrow = 1, common.legend = TRUE, 
          legend = "top")

There is a funnel shape in the residuals versus fitted values, the residuals appear normally distributed. This is likely because the bprs is not normally distributed, and indeed log-transform seems to help. We leave it at that, but note that more work is required to diagnose the models.

plot(BPRS_ref2)
qqnorm(resid(BPRS_ref2))

# transform
BPRSLOG <- BPRSL
BPRSLOG$bprs <- log10(BPRSLOG$bprs)
BPRS_ref3 <- lmerTest::lmer(bprs ~ week * treatment + (week | subject), data = BPRSLOG, REML = FALSE)
plot(BPRS_ref3)
qqnorm(resid(BPRS_ref3))


(Course chapters completed.)